Venice is drowning - final report

Abstract

The main objective of the project is to analyze the data of the tide detections regarding the area of the Venice lagoon, producing predictive models whose performances are evaluated on a time horizon ranging from one hour up to a week of forecast.

For this purpose, three models, both linear and machine-learning based, are tested:

  • ARIMA (AutoRegressive Integrated Moving Average);
  • UCM (Unobserved Component Models);
  • LSTM (Long Short-Term Memory).

Datasets

Two datasets are the basis for the project pipeline:

  • the “main” dataset contains the tides level measurements (in cm) in the Venice lagoon from a certain reference level, obtained through the use of a sensor, between 1983 and 2018;
  • a second dataset holds the information regarding meteorological variables such as rainfall in mm, wind direction in degree at 10 meters and finally wind speed at 10 meters in meters per second in the periods between 2000 and 2019.

The tides level dataset is composed using the single historical datasets made public by the city of Venice, in particular from Centro Previsioni e Segnalazioni Maree. The data regarding the meteorological variables, instead have been provided, on request, by ARPA Veneto.

All the preprocessing operations regarding parsing, inspection and the final union of the cited datasets are available in the following scripts:

  • parsing_tides_data allows to perform the construction of the tidal dataset, importing and unifying each single annual dataset;
  • inspection contains a series of preliminar inspection of the aformentioned data:
  • preprocess_weather_data_2000_2019 contains the preprocessing operations of the weather-related dataset;
  • parsing_tides_weather reports a summary of the procedure implemented in order to deal with missing data in the weather dataset, and contains the merging operation producing the final weather dataset.

As a precise choice, due to time-related and computational reasons, only the data ranging from 2010 and 2018 are kept after the preprocessing.

Data inspection

During the preprocessing phase, some descriptive visualizations regarding the main time series are produced in order to inspect its characteristics.

Figure 1
Figure 1: Time serie visualization with autocorrelation and partial autocorrelation plots


Fig.1 reports the entire time series, represented together with the autocorrelation and partial autocorrealtion plots. Observing Fig.2, it is possible to notice how the tidal phenomenon seems to be characterized by a normal distribution. In such a case, it is worth noticing that from the analytic perspective the concepts of strict and weak stationarity are equivalent.

Figure 2
Figure 2: Time serie distribution


During the preliminary inspection of the historycal data, one of the investigations deals with the mean and variance stationarity of the considered time series. Regarding the former, Fig.1 suggests the series to be characterized by a stationary mean. This hypothesis is further proven by the output of the Augmented Dickey-Fuller test, confirming the in-mean stationarity of the tidal phenomenon.

Figure 3
Figure 3: output test ADF


On the other hand, the latter characteristics (stationary variance) can be verified observing Fig.4, where the daily average of the tidal level values are plotted against the daily aggregation of the standard deviation. In particular, no clear trends emerges in such a plot, enforcing the conclusion that the plotted quantities are uncorrelated.

Figure 4
Figure 4: Visualization of stationarity in variance


Figure 5
Figure 5: Frequency visualization using periodogram


Models

The produced models will focus on two areas, a purely statistical one with linear models such as ARIMA and UCM and a a machine learning approach, through the investigation of an LSTM model. The preparation and implementation of the models will be presented below and a section of results, where it will be possible to make a rapid comparison between the performance of the models on a test set defined a priori, will be eventually proposed. For the modelling approach, it is worth highlighting the different subsets of the whole dataset exploited for each one of the described areas:

  • for the linear models the training set is composed by the last six months of 2018, from July to December;
  • for the machine learning approach, considering the capability of handling more data within a non-explosive computational time, the training set covers the period between January 2010 and December 2018.

The test set, previously extracted, is common among the different approaches and consists of the last two weeks of December 2018, i.e. from 17/12/2018 23:00:00 to 31/12/2019 23:00:00.

Figure 6
Figure 6: Train and test data representation


Concerning the linear models, two strategies are considered: the former consists of integrating the meteorological variables and the lunar motion in the main analysis, while the latter is based on performing a sort of harmonic analysis by re-using some of the most common principal periodic components previously extracted from other studies in the field. The adaptation of such components to our series is done exploiting oce, an R package that helps Oceanographers with their work by providing utils to investigate and elaborate Oceanographic data files.

plotly
Figure 7: Interactive plot representing lunar motion between 2010 and 2018


The second strategy instead, as anticipated, concerns the extraction of principal periodic components from a time series about sea levels commonly treated in literature, and the re-using of those components as regressors for the tides level time series of Venice. The oce package provide a function called tidem able to fit a model in terms of sine and cosine components at the indicated tidal frequencies, with the amplitude and phase being calculated from the resultant coefficients on the sine and cosine terms. Tidem provides the possibility to “adapt” up to 69 components but we focused on 8 of them, in particular:

  • M2, main lunar semi-diurnal with a period of ~12 hours;
  • S2, main solar semi-diurnal (~12 hours);
  • N2, lunar-elliptic semi-diurnal (~13 hours);
  • K2, lunar-solar semi-diurnal (~12 hours);
  • K1, lunar-solar diurnal (~24 hours);
  • O1, main lunar diurnal (~26 hours);
  • SA, solar annual (~24*365 hours);
  • P1, main solar diurnal (24 hours).
plotly
Figure 8: Interactive filtering plot for the extracted components

ARIMA

Both the realized linear models use the data between 25/06/2018 00:00:00 and 17/12/2018 23:00:00 as training set. This choice is determined by the needs of optimizing the models’ fitting time because, taking more data, there would have been an important computational-time explosion. Both ARIMA and UCM are implemented in R, in particular with forecast and KFAS packages.

As a first approach to the forecast task we trained two ARIMA models: the former is trained using as regressors the meteorological data provided by ARPA Veneto with the lunar motion obtained using PyEPhem while the latter using the 8 harmonics from oce and tidem.

Starting from the first model, it’s worth to notice that the meteorological data have been standardized and the lunar motion is elaborated in the following form, recalling the general shape of the grativational potential (proportional to the inverse of the squared distance):

\[\begin{equation} lunar\_var = \frac{1}{lunar\_motion^2} \end{equation}\]

The first impact of the variable is substantially insignificant as is possible to observe in Fig.9: the correlations and the partial correlations mantain their magnitude and the fitting on the training data is weak.

Figure 9
Figure 9: First output from ARIMA 1


After several attemps, following the represented Lag on ACF and PACF plots in combination with the value of the AICc and the Mean Absolute Percentage Error (MAPE), a highly parameterized model has been reached with the form (3,1,3)(1,1,3)[24]. Although the autocorrelation has not been completely absorbed, the Box-Ljung test indicates that it is no longer relevant for the first few hours. The fitting on the training set also improves considerably and the performance on the test proves to be quite good as it will be illustrated below. The autocorrelation and the fitting performances are visible in Fig.10.

Figure 10 Figure 10b
Figure 10: Final output from ARIMA 1


As anticipated, the second ARIMA model realized use as input regressors the 8 harmonics extracted using tidem and oce. Since it is a functional form based solely on time, it is possible for us to obtain them also for the projections. An example of this harmonics is shown in fig. 8 in an interactive fashion. The effect of the use of harmonics is already visible starting from the basic model, that is, the one without autoregressive components or moving average: the autocorrelation plots and the model fitting on the train set is visibile in fig. 11.

Figure 11
Figure 11: Initial output from ARIMA 2


Also in this case, following the observations of the autocorrelation plots and the trend of the AICc we proceeded to insert the autoregressive and moving average components until obtaining the final model (3,0,2)(1,0,0)[24] with drift. It is worth to notice how this model is decidedly less parameterized than the previous one: this is due to the fact that the harmonics inserted manage to explain a large part of the historical series by themselves as can be seen from fig. 11. After the fitting the resulting autocorrelation plots appears as reported in fig. 12. Even in this case it was not possible to absorb all the autocorrelation but the Box-Ljung text ensures that the residuals are white noise up to the third lag and the performances on train and test set are pretty good.

Figure 12 Figure 12b
Figure 12: Final output from ARIMA 2


UCM

As a second typology of models regarding the section of linear models, a UCM model has been implemented, in addition to the ARIMA models already mentioned. Numerous attempts were made for this type of model: first of all an attempt was made to create a model based solely on the UCM components available as trigonometric and dummy seasonality, a trend with all its manifestations (LLT, IRW, RW) together with a stationary cycle, then an attempt was made to use the meteorological variables together with the lunar cycle to try to increase the portion of the time serie explained by the model and finally an attempt was made again using the harmonics calculated using the oce package. The best performing model was found to be the one composed of harmonics inserted directly as components together with a trend component, specifically a random walk.

In this sense it is possible once again to understand how much harmonics can contribute in explaining the level of the tides. Later, as for the ARIMA models, the results of the UCM model will also be shown.

LSTM

Inserire qui procedimento svolto LSTM

Results

plotly
Figure X:

Conclusions

Dario Bertazioli, Fabrizio D’Intinosante

2020-01-25